Search CORE

54 research outputs found

On information plus noise kernel random matrices

Author: Karoui Noureddine El
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2009
Field of study

Kernel random matrices have attracted a lot of interest in recent years, from both practical and theoretical standpoints. Most of the theoretical work so far has focused on the case were the data is sampled from a low-dimensional structure. Very recently, the first results concerning kernel random matrices with high-dimensional input data were obtained, in a setting where the data was sampled from a genuinely high-dimensional structure---similar to standard assumptions in random matrix theory. In this paper, we consider the case where the data is of the type "information

{}+{}

noise." In other words, each observation is the sum of two independent elements: one sampled from a "low-dimensional" structure, the signal part of the data, the other being high-dimensional noise, normalized to not overwhelm but still affect the signal. We consider two types of noise, spherical and elliptical. In the spherical setting, we show that the spectral properties of kernel random matrices can be understood from a new kernel matrix, computed only from the signal part of the data, but using (in general) a slightly different kernel. The Gaussian kernel has some special properties in this setting. The elliptical setting, which is important from a robustness standpoint, is less prone to easy interpretation.Comment: Published in at http://dx.doi.org/10.1214/10-AOS801 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Tracy--Widom limit for the largest eigenvalue of a large class of complex sample covariance matrices

Author: Karoui Noureddine El
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2007
Field of study

We consider the asymptotic fluctuation behavior of the largest eigenvalue of certain sample covariance matrices in the asymptotic regime where both dimensions of the corresponding data matrix go to infinity. More precisely, let

X

be an

n\times p

matrix, and let its rows be i.i.d. complex normal vectors with mean 0 and covariance

\Sigma_p

. We show that for a large class of covariance matrices

\Sigma_p

, the largest eigenvalue of

X^*X

is asymptotically distributed (after recentering and rescaling) as the Tracy--Widom distribution that appears in the study of the Gaussian unitary ensemble. We give explicit formulas for the centering and scaling sequences that are easy to implement and involve only the spectral distribution of the population covariance,

n

and

p

. The main theorem applies to a number of covariance models found in applications. For example, well-behaved Toeplitz matrices as well as covariance matrices whose spectral distribution is a sum of atoms (under some conditions on the mass of the atoms) are among the models the theorem can handle. Generalizations of the theorem to certain spiked versions of our models and a.s. results about the largest eigenvalue are given. We also discuss a simple corollary that does not require normality of the entries of the data matrix and some consequences for applications in multivariate statistics.Comment: Published at http://dx.doi.org/10.1214/009117906000000917 in the Annals of Probability (http://www.imstat.org/aop/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Concentration of measure and spectra of random matrices: Applications to correlation matrices, elliptical distributions and beyond

Author: Karoui Noureddine El
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2009
Field of study

We place ourselves in the setting of high-dimensional statistical inference, where the number of variables

p

in a data set of interest is of the same order of magnitude as the number of observations

n

. More formally, we study the asymptotic properties of correlation and covariance matrices, in the setting where

p/n\to\rho\in(0,\infty),

for general population covariance. We show that, for a large class of models studied in random matrix theory, spectral properties of large-dimensional correlation matrices are similar to those of large-dimensional covarance matrices. We also derive a Mar\u{c}enko--Pastur-type system of equations for the limiting spectral distribution of covariance matrices computed from data with elliptical distributions and generalizations of this family. The motivation for this study comes partly from the possible relevance of such distributional assumptions to problems in econometrics and portfolio optimization, as well as robustness questions for certain classical random matrix results. A mathematical theme of the paper is the important use we make of concentration inequalities.Comment: Published in at http://dx.doi.org/10.1214/08-AAP548 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

The spectrum of kernel random matrices

Author: Karoui Noureddine El
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 04/01/2010
Field of study

We place ourselves in the setting of high-dimensional statistical inference where the number of variables

p

in a dataset of interest is of the same order of magnitude as the number of observations

n

. We consider the spectrum of certain kernel random matrices, in particular

n\times n

matrices whose

(i,j)

th entry is

f(X_i'X_j/p)

f(\Vert X_i-X_j\Vert^2/p)

where

p

is the dimension of the data, and

X_i

are independent data vectors. Here

f

is assumed to be a locally smooth function. The study is motivated by questions arising in statistics and computer science where these matrices are used to perform, among other things, nonlinear versions of principal component analysis. Surprisingly, we show that in high-dimensions, and for the models we analyze, the problem becomes essentially linear--which is at odds with heuristics sometimes used to justify the usage of these methods. The analysis also highlights certain peculiarities of models widely studied in random matrix theory and raises some questions about their relevance as tools to model high-dimensional data encountered in practice.Comment: Published in at http://dx.doi.org/10.1214/08-AOS648 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

A rate of convergence result for the largest eigenvalue of complex white Wishart matrices

Author: Karoui Noureddine El
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2007
Field of study

It has been recently shown that if

X

is an

n\times N

matrix whose entries are i.i.d. standard complex Gaussian and

l_1

is the largest eigenvalue of

X^*X

, there exist sequences

m_{n,N}

and

s_{n,N}

such that

(l_1-m_{n,N})/s_{n,N}

converges in distribution to

W_2

, the Tracy--Widom law appearing in the study of the Gaussian unitary ensemble. This probability law has a density which is known and computable. The cumulative distribution function of

W_2

is denoted

F_2

. In this paper we show that, under the assumption that

n/N\to \gamma\in(0,\infty)

, we can find a function

M

, continuous and nonincreasing, and sequences

\tilde{\mu}_{n,N}

and

\tilde{\sigma}_{n,N}

such that, for all real

s_0

, there exists an integer

N(s_0,\gamma)

for which, if

(n\wedge N)\geq N(s_0,\gamma)

, we have, with

l_{n,N}=(l_1-\tilde{\mu}_{n,N})/\tilde{\sigma}_{n,N}

\forall s\geq s_0\qquad (n\wedge N)^{2/3}|P(l_{n,N}\leq s)-F_2(s)|\leq M(s_0)\exp(-s).

The surprisingly good 2/3 rate and qualitative properties of the bounding function help explain the fact that the limiting distribution

W_2

is a good approximation to the empirical distribution of

l_{n,N}

in simulations, an important fact from the point of view of (e.g., statistical) applications.Comment: Published at http://dx.doi.org/10.1214/009117906000000502 in the Annals of Probability (http://www.imstat.org/aop/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Second order accurate distributed eigenvector computation for extremely large matrices

Author: d'Aspremont Alexandre
Karoui Noureddine El
Publication venue
Publication date: 01/01/2009
Field of study

We propose a second-order accurate method to estimate the eigenvectors of extremely large matrices thereby addressing a problem of relevance to statisticians working in the analysis of very large datasets. More specifically, we show that averaging eigenvectors of randomly subsampled matrices efficiently approximates the true eigenvectors of the original matrix under certain conditions on the incoherence of the spectral decomposition. This incoherence assumption is typically milder than those made in matrix completion and allows eigenvectors to be sparse. We discuss applications to spectral methods in dimensionality reduction and information retrieval.Comment: Complete proofs are included on averaging performanc

arXiv.org e-Print Archive

CiteSeerX

Crossref

Operator norm consistent estimation of large-dimensional sparse covariance matrices

Author: Karoui Noureddine El
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 21/01/2009
Field of study

Estimating covariance matrices is a problem of fundamental importance in multivariate statistics. In practice it is increasingly frequent to work with data matrices

X

of dimension

n\times p

, where

p

and

n

are both large. Results from random matrix theory show very clearly that in this setting, standard estimators like the sample covariance matrix perform in general very poorly. In this "large

n

, large

p

" setting, it is sometimes the case that practitioners are willing to assume that many elements of the population covariance matrix are equal to 0, and hence this matrix is sparse. We develop an estimator to handle this situation. The estimator is shown to be consistent in operator norm, when, for instance, we have

p\asymp n

n\to\infty

. In other words the largest singular value of the difference between the estimator and the population covariance matrix goes to zero. This implies consistency of all the eigenvalues and consistency of eigenspaces associated to isolated eigenvalues. We also propose a notion of sparsity for matrices, that is, "compatible" with spectral analysis and is independent of the ordering of the variables.Comment: Published in at http://dx.doi.org/10.1214/07-AOS559 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref